Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

Algorithms for Binary Neural Networks

3.5.5

Forward Propagation Based on Projection Convolution Layer

For each full precision kernel C^l

i^{, the corresponding quantized kernels ˆ}^C^l

i,j ^{are concatenated}

to construct the kernel D^l

i ^{that actually participates in the convolution operation as}

D^l

i ^{= ˆ}^C^l

i,1 ^⊕^ˆ^C^l

i,2 ^{⊕· · · ⊕}^ˆ^C^l

i,J^,

(3.45)

where ⊕denotes the concatenation operation on the tensors. In PCNNs, the projection

convolution is implemented based on D^land F ^lto calculate the next layer’s feature map

F ^l⁺¹.

F ^l⁺¹= Conv2D(F ^l, D^l),

(3.46)

where Conv2D is the traditional 2D convolution. Although our convolutional kernels are

3D-shaped tensors, we design the following strategy to ﬁt the traditional 2D convolution as

F ^l⁺¹

h,j ⁼

i,h

F ^l

h ^⊗^D^l

i,j^,

(3.47)

F ^l⁺¹

= F ^l

h,1 ^{⊕· · · ⊕}^F^l

h,J^,

(3.48)

where ⊗denotes the convolutional operation. F ^l⁺¹

h,j

is the jth channel of the hth feature

map at the (l + 1)th convolutional layer and F ^l

h ^{denotes the}^h^{th feature map at the}^l^th

convolutional layer. To be more precise, for example, when h = 1, for the jth channel of

an output feature map, F ^l⁺¹

1,j ^{is the sum of the convolutions between all the}^h^{input feature}

maps and i corresponding quantized kernels. All channels of the output feature maps are

obtained as F ^l⁺¹

h,1 ^{, .., F}^l⁺¹

h,j ^{, ..., F}^l⁺¹

h,J ^{, and they are concatenated to construct the}^h^{th output}

feature map F ^l⁺¹

It should be emphasized that we can utilize multiple projections to increase the diversity

of convolutional kernels D^l. However, the single projection can perform much better than the

existing BNNs. The essential is the use of DBPP, which diﬀers from [147] based on a single

quantization scheme. Within our convolutional scheme, there is no dimension disagreement

on feature maps and kernels in two successive layers. Thus, we can replace the traditional

convolutional layers with ours to binarize widely used networks, such as VGGs and ResNets.

At inference time, we only store the set of quantized kernels D^l

i ^{instead of the full-precision}

ones; that is, projection matrices W ^l

j ^{are not used for inference, achieving a reduction in}

storage.

3.5.6

Backward Propagation

According to Eq. 3.44, what should be learned and updated are the full-precision kernels

C^l

i ^{and the projection matrix}^W^l⁽

W ^l) using the updated equations described below.

Updating C^l

i^:^{We deﬁne}^δ^Ci ^{as the gradient of the full-precision kernel}^Cⁱ^{, and have}

δCl

i ⁼^∂L

∂C^l

= ^∂L^S

∂C^l

+ ^∂L^P

∂C^l

(3.49)

C^l

i ^←^C^l

i ⁻^η¹^δC^l

i^,

(3.50)

where η1 is the learning rate for the convolutional kernels. More speciﬁcally, for each item

in Eq. 3.49, we have

∂LS

∂C^l

∂LS

∂^ˆC^l

i,j

∂P ^l,j

Ω^N⁽

W ^l

j^{, C}^l

i⁾

∂(

W ^l

j ^◦^C^l

i⁾

∂(

W ^l

j ^◦^C^l

i⁾

∂C^l

∂LS

∂^ˆC^l

i,j

◦1−1≤

W ^l

j^◦^C^l

i^≤¹^◦

W ^l

j^,

(3.51)